About this report

This report presents the actual results from Penzel & Cho et al. (2025), using the real PRONIA data (which is not publicly available). To follow the analysis steps, we have provided a simulated dataset in this GitHub repository. You can run the simulation using pronia_psychosis_risk_dwi_simulation.Rmd. Make sure your working directory is set to the repository folder for the code to run properly downloaded from: https://github.com/npenzel/psychosis_dwi_disorder_risk. To set the working directory accordingly: setwd(“/path_to_your_directory/psychosis_dwi_disorder_risk/”)

For details on methods, reasoning, and background, please refer to the manuscript.

1 The Sample

1.1 Demographics

Shows the sociodemographic variables from the original PRONIA sample.
healthy
controls
recent-onset
psychosis
clinical
high-risk
variables HC ROP CHR
Total 298 206 183
Female (%) 178 (59.73) 91 (44.17) 95 (51.91)
1st degree relative (%) NA (NA) 32 (15.53) 36 (19.67)
Age (mean [SD]) 25.39 [6.07] 25.87 [5.83] 23.89 [5.67]
body-mass index (mean [SD]) 22.8 [3.58] 23.01 [5.12] 23.04 [4.25]
GAF symptoms: Life (mean [SD]) 88.41 [6.92] 79.16 [9.75] 79.27 [8.64]
GAF symptoms: Past Month (mean [SD]) 86.98 [7.61] 39.21 [14.67] 51.42 [10.45]
GAF disability: Life (mean [SD]) 87.07 [6.66] 78.7 [9.21] 79.2 [7.49]
GAF disability: Past Month (mean [SD]) 85.67 [7.13] 43.08 [13.59] 51.73 [11.59]
PANSS-Positive (mean [SD]) NaN [NA] 19.4 [5.86] 11.73 [3.52]
PANSS-Negative (mean [SD]) NaN [NA] 15.65 [7.35] 13.66 [6.19]
PANSS-General (mean [SD]) NaN [NA] 34.33 [10.45] 29.09 [7.33]
PANSS-Total (mean [SD]) NaN [NA] 69.43 [19.44] 54.49 [13.68]
GF:Role current (mean [SD]) 8.55 [0.69] 4.91 [1.89] 5.65 [1.71]
GF:Role lifetime (mean [SD]) 8.69 [0.67] 7.83 [1] 7.99 [0.88]
GF:Social current (mean [SD]) 8.48 [0.79] 5.64 [1.55] 6.28 [1.29]
GF:Social lifetime (mean [SD]) 8.74 [0.69] 7.84 [0.92] 7.95 [0.79]
Exposome-score (mean [SD]) 1.37 [0.79] 2.39 [1.14] 2.39 [0.99]
SIPS-P (mean [SD]) 0.27 [0.83] 15.67 [5.22] 8.12 [4.69]
SIPS-N (mean [SD]) 0.21 [0.86] 10.78 [7.46] 10.36 [6.45]
SIPS-G (mean [SD]) 0.29 [0.79] 7.34 [4.32] 8.14 [4.01]
SIPS-D (mean [SD]) 0.11 [0.44] 5.21 [4.1] 3.47 [2.99]
BDI sum (mean [SD]) 3.41 [5.19] 21.64 [12.28] 29.12 [11.7]
CPZE cumulative lifetime (mean [SD]) 0 [0] 8344.83 [17643.85] 1471.22 [7563.02]
WAIS-V (mean [SD]) 11.69 [2.65] 9.53 [3.42] 11.05 [3.27]
WAIS-MR (mean [SD]) 11.42 [2.15] 10.01 [2.68] 11.01 [2.38]
SPI-A (mean [SD]) 0.07 [0.33] 7.01 [7.3] 6.22 [5.18]

2 Data investigation

Here, we investigate the data in terms of normality, distribution and association with age and sex.

2.1 Average FAt across 3 groups

2.1.1 Figure Average FAt

2.1.2 Code

# Average FAt for the three groups
tbss_final_fat %>%
  filter(Studygroup %in% c('CHR', 'HC', 'ROP'))%>%
  # Create a ggplot with 'Studygroup' on the x-axis and 'AverageFAt' on the y-axis
  ggplot(aes(x = Studygroup, y = AverageFAt, fill = Studygroup))+
  # Add a half-eye plot from the ggdist package for visualizing the distribution
  ggdist::stat_halfeye(
    adjust = 0.5,                   # Adjust the smoothness of the half-eye plot
    justification = -0.2,           # Position the plot to the left
    .width = 0,                     
    point_colour = NA)+             # Remove points for clarity
  # Add a boxplot for the data with customizations
  geom_boxplot(width = .12,                  # Narrow box width
               outlier.color = 'black',      # Color outliers in black 
               alpha = 0.5)+                 # Make boxplot semi-transparent
  # Customize the axis titles and legend position
  theme(axis.title.x = element_blank(),
        legend.position = 'bottom')+
  scale_x_discrete(limits = c('HC','CHR','ROP'),
                   labels = c("CHR" = "clinical\nhigh-risk\nN=183",
                              "ROP" = "recent-onset\npsychosis\nN=206",
                              'HC' = 'healthy\ncontrols\nN=298'))+
  scale_fill_manual(values = c('ROP' = '#e69c9c',
                               'CHR' = '#8ba6b1',
                               'HC' =  '#98c692'))+
  ylab('Average fractional anisotropy - tissue across\n25 ENIGMA regions of interest')+
  theme_bw()+
  theme(legend.position = 'none',
        axis.title.y = element_text(size = 15),
        axis.text = element_text(size = 13))

2.2 Assumptions: Normality

Although our datasets have a sample size that supports parametric tests, we still assess the assumptions of normality by examining the distribution and QQ-plots.

2.2.1 Figure Distribution

2.2.2 Code Distribution

tbss_final_fat %>%
  filter(Studygroup %in% c('CHR', 'HC', 'ROP'))%>%
  ggplot(aes(x = AverageFAt, color = Studygroup, fill = Studygroup))+
  geom_histogram(aes(y = ..density..))+
  scale_fill_manual(values = c('ROP' = '#e69c9c',
                               'CHR' = '#8ba6b1',
                               'HC' =  '#98c692'))+
  scale_color_manual(values = c('ROP' = '#e69c9c',
                               'CHR' = '#8ba6b1',
                               'HC' =  '#98c692'),
                     labels = c("CHR" = "clinical\nhigh-risk\nN=183",
                               "ROP" = "recent-onset\npsychosis\nN=207",
                               'HC' = 'healthy\ncontrols\nN=298'))+
  theme_bw()+
  theme(legend.position = 'none',
        axis.title.y = element_text(size = 15),
        axis.text = element_text(size = 13))+
  stat_function(fun = dnorm, coor = 'black',
                args = list(mean = mean(tbss_final_fat$AverageFAt), 
                            sd = sd(tbss_final_fat$AverageFAt)), size = 1)+
  facet_wrap(.~Studygroup, ncol = 1)

2.2.3 Figure Normality: QQ-plot

2.2.4 Code Normality: QQ-plot

tbss_final_fat %>%
  filter(Studygroup %in% c('ROP', 'CHR', 'HC'))%>%
  ggplot(., aes(sample = AverageFAt))+ 
  stat_qq()+
  facet_wrap(.~Studygroup, ncol = 1)+
  theme_bw()

2.3 Association with age and sex

2.3.1 Best model fit

We compared three model fits (linear, quadratic, and gamma) for predicting average fractional anisotropy (FA-t) using age and sex as independent variables. Model performance was compared AIC. For the PRONIA data, the model with the lowest AIC was selected quadratic.
This model explained 100 % of the variance across all models.

2.3.2 Code: best model fit

### Best model fit
# ------------------------------------------------------------------------------ 
# Test for the best model fit using AIC
# ------------------------------------------------------------------------------
# We defined and tested three models: linear, quadratic, and gamma.
# To compare models using AIC, we must convert the linear and quadratic models 
# into a more generic form (glm).
linear_fat_glm <- glm(linear_fat)
quadratic_fat_glm <- glm(quadratic_fat)

# Create a list of the models to be compared
model.set <- list(linear_fat_glm, quadratic_fat_glm, gamma_fat)
model.names <- c('linear', 'quadratic', 'gamma')

# Compare the models using AIC (lower AIC indicates better fit)
aictab_model_results_fat <- aictab(model.set, modnames = model.names)

# AIC interpretation:
# A lower AIC value indicates a better-fitting model.
# A model with a delta-AIC (difference in AIC values) greater than 2 
# is considered significantly better than the competing model.

# Prepare the dataset for model fitting (including relevant variables)
model_all <- tbss_final_fat %>%
  filter(Studygroup %in% c('ROP', 'CHR', 'HC'))%>%
  dplyr::select(PSN, AverageFAt, AverageFAt_predicted, AverageFAt_dev,
                age,sex, Studygroup)

2.3.3 Figure: Model fit

2.3.4 Code: Model fit

facet_label = c('HC' = 'healthy controls N=298',
                'CHR' = 'clinical high-risk N=183',
                'ROP' = 'recent-onset psychosis N=206')
model_all$Studygroup <- factor(model_all$Studygroup, levels = c('HC', 'CHR', 
                                                                'ROP'))
model_fit_data <- model_all %>%
  ggplot(aes(x = age, y = AverageFAt, color = Studygroup)) +
  stat_smooth(method = 'lm', formula = y ~ x + I(x^2), se = FALSE,
              show.legend = FALSE)+
  geom_point(show.legend = FALSE)+
  scale_color_manual(values = c('ROP' = '#e69c9c',
                                'CHR' = '#8ba6b1',
                                'HC' =  '#98c692'),
                     labels = c("CHR" = "clinical\nhigh-risk\nN=183",
                                "ROP" = "recent-onset\npsychosis\nN=206",
                                'HC' = 'healthy\ncontrols\nN=298'))+
  ylab('Average fractional anisotropy - tissue across\n25 ENIGMA regions of interest')+
  theme_bw()+
  theme(axis.title.y = element_text(size = 15),
        axis.text = element_text(size = 13))+
  facet_grid(sex~ Studygroup, labeller = labeller(Studygroup = facet_label))
model_fit_data

3 Group comparison

3.1 ANCOVA FAt ROIs

3.1.1 Result: ANCOVA

3.1.2 Code: ANCOVA

# ------------------------------------------------------------------------------
# ANCOVA Analysis for Three Groups (ROP, HC, and CHR)
# ------------------------------------------------------------------------------

# Prepare the data for ANCOVA analysis by selecting relevant columns:
# Studygroup (group variable), PSN, rois (regions of interest), age, and sex
anova_3groups_result <- tbss_final_fat %>%
  dplyr::select(Studygroup, PSN, rois, age, sex) %>%
  dplyr::select(-contains('Average'))%>%
  filter(Studygroup %in% c('ROP', 'HC', 'CHR')) %>%
  # Reshape the 'rois' (regions of interest) from wide to long format
  pivot_longer(cols = rois, 
               names_to = 'roi', values_to = 'value') %>%
  # Group by the 'roi' variable for performing ANCOVA per region
  group_nest(roi) %>%
  # Perform ANCOVA (Analysis of Covariance) for each region of interest
  mutate(anova_result = map(data, ~ Anova(aov(value ~ Studygroup + age +
                                              I(age^2) + sex, 
                                              data = .x), 
                                          type = 3, test.statistic = 'F'))) %>%
  # Tidy up the ANOVA results to make them easier to work with
  mutate(tidy_anova = map(anova_result, ~ tidy(.))) %>%
  # Unnest the tidied ANOVA results into a data frame for easy interpretation
  unnest(tidy_anova) %>%
  # For each term in the ANOVA results, adjust p-values for multiple comparisons using FDR correction
  group_by(term) %>%
  mutate(p_corr = p.adjust(p.value, method = 'fdr', n = length(p.value)),
         f_value = if_else(term == 'Studygroup', statistic, NA_real_)) %>%
  # Remove the intercept term from the results, as it's not relevant to our analysis
  filter(!str_detect(term, '(Intercept)'))%>%
  ungroup()%>%
  # For each ROI, calculate the total sum of squares (SS) across all terms
  group_by(roi)%>%
  mutate(ss_total = sum(sumsq))%>%
  ungroup()%>%
  # Calculate eta-squared for effect size for the 'Studygroup' term
  mutate(eta_square = sumsq/ss_total)%>%
  filter(term == 'Studygroup') %>%
  # Sort the results by the F-statistic in descending order (to see the most significant ROIs)
  arrange(-statistic) %>%
  dplyr::select(-c(data, anova_result, ss_total, sumsq))

# Filter the results to identify regions of interest (ROIs) with significant 
# effects (p-value < 0.05 after correction)
significant_rois_3groups <- anova_3groups_result %>%
  filter(p_corr < 0.05)

# Display the results in an interactive datatable with options to export to CSV, PDF, etc.
datatable(anova_3groups_result, extensions = 'Buttons', options = list(
  dom = 'Bfrtip',
  buttons = c('copy', 'csv', 'pdf', 'print'),
  pageLength = 25 # Change this to the number of rows you want to display
  ))

3.1.3 Post-Hoc Tests with the emmeans Package ROIs*

In this analysis, ANCOVA was performed with ROI as the dependent variable and study group, age, sex, and age-squared as independent variables. The study group was the primary variable of interest. The emmeans package in R was used to compute estimated marginal means (EMMs), or least-squares means, for the study groups while adjusting for age, sex, and age-squared. This allows for post-hoc pairwise comparisons between study groups, providing insight into the differences after accounting for these covariates.

3.1.4 Code: Post-Hoc Tests with the emmeans Package ROIs*

In this analysis, ANCOVA was performed with ROI as the dependent variable and study group, age, sex, and age-squared as independent variables. The study group was the primary variable of interest. The emmeans package in R was used to compute estimated marginal means (EMMs), or least-squares means, for the study groups while adjusting for age, sex, and age-squared. This allows for post-hoc pairwise comparisons between study groups, providing insight into the differences after accounting for these covariates.

# Post-hoc results across 3 groups
result_posthoc <- tbss_final_fat %>%
  # Select relevant columns (Studygroup, PSN, rois, age, sex) for the analysis
  dplyr::select(Studygroup, PSN, rois, age, sex) %>%
  # Exclude columns related to 'Average'
  dplyr::select(-contains('Average'))%>%
  # Filter the dataset to include only the three study groups: ROP, HC, and CHR
  filter(Studygroup %in% c('ROP', 'HC', 'CHR')) %>%
  # Reshape data from wide to long format, with each ROI as a separate row
  pivot_longer(cols = rois, 
               names_to = 'roi', values_to = 'value') %>%
  # Group data by region of interest (ROI) for subsequent analysis
  group_by(roi) %>%
  do({
    # Perform ANCOVA with 'Studygroup', 'age', 'sex', and 'age^2' as predictors
    anova_result <- aov(value ~ Studygroup + age + I(age^2) + sex, data = .)
    # Conduct type 3 ANCOVA to assess significance of the predictors
    ANOVA_result <- Anova(anova_result, type = 3)
    # Check if the p-value for the Studygroup effect is significant (p < 0.05)
    if (ANOVA_result$`Pr(>F)`[2] < 0.05){
      significant_test <- 'yes'
    }else{
      significant_test <- 'no'
    }
    if (nrow(.) & significant_test == 'yes') {
      posthoc_result <- emmeans(anova_result, pairwise ~ Studygroup, adjust = 'fdr')
      # Extract contrasts from the post-hoc analysis results
      data.frame(posthoc_result$contrasts)
    } else {
      # If not significant, return an empty data frame
      data.frame()
    }
  }) %>%
  # Ungroup the data after the post-hoc analysis
  ungroup()

# Filter the ANOVA results to keep only significant regions (p_corr < 0.05)
sig_rois_anova <- anova_3groups_result %>% 
  filter(p_corr < 0.05)

# Extract relevant columns for the post-hoc results, and filter by significant ROIs
result_posthoc_pp <- result_posthoc %>%
  dplyr::select(roi, contrast, t.ratio, p.value)%>%
  filter(roi %in% sig_rois_anova$roi)

# Create an interactive data table to display the post-hoc results with options to copy, export as CSV/PDF, or print
datatable(result_posthoc_pp, extensions = 'Buttons', options = list(
  dom = 'Bfrtip',
  buttons = c('copy', 'csv', 'pdf', 'print'),
  pageLength = 25 # Change this to the number of rows you want to display
  ))

4 Canonical Correlation (CCA)

The canonical correlation is performed in ROP and CHR individuals only.

4.1 Canonical Correlation

4.1.1 Data preparation

# Data preparation for canonical correlation analysis (CCA)

# Filtering the dataset to include only the ROP and CHR study groups
rop_chr_vars <- tbss_final_fat %>%
  filter(Studygroup %in% c('ROP', 'CHR'))%>%
  # Renaming the variable 'UHR_GENETIC_1stdegree_00_Summary_Screening' for clarity
  rename(uhr_1stdegree = UHR_GENETIC_1stdegree_00_Summary_Screening)%>%
  # Creating a numeric variable for 'sex' (0 for male, 1 for female)
  mutate(sex_num = case_when(sex == 'male' ~ 0,
                             sex == 'female' ~ 1))%>%
  # Reversing the scales for variables where higher values indicate worse outcomes 
  # (e.g., negative scores should be positive)
  mutate(sips_p_t0_re = -sips_p_t0,
         sips_n_t0_re = -sips_n_t0,
         sips_g_t0_re = -sips_g_t0,
         sips_d_t0_re = -sips_d_t0,
         spia_sum_t0_re = -spia_sum_t0,
         bdi_t0_re = -bdi_t0,
         CPZE_cum_sum_re = -CPZE_cum_sum,
         exposome_score_re = -exposome_score)%>%
  mutate(uhr_1stdegree_re = case_when(uhr_1stdegree == 1 ~ 0,
                                   uhr_1stdegree == 0 ~ 1))%>%
  # Selecting relevant variables for the analysis
  dplyr::select(PSN, Studygroup, 
                GAF_DI_PastMonth_Screening, 
                GAF_DI_LifeTime_Screening,
                GAF_S_PastMonth_Screening, 
                GAF_S_LifeTime_Screening,
                sips_p_t0_re, 
                sips_n_t0_re, 
                sips_g_t0_re, 
                sips_d_t0_re, 
                spia_sum_t0_re,
                bdi_t0_re,
                CPZE_cum_sum_re, 
                exposome_score_re,
                uhr_1stdegree_re,
                wais_mr, 
                wais_v
                )%>%
  rowwise() %>%
  # Convert all variables to character type for consistency, ensuring no issues with factors
  mutate(across(everything(.), ~ as.character(.)))%>%
  # Remove rows with all values being missing besides Studygroup and ID
  filter(!sum(is.na(c_across())) == ncol(.) - 2)%>%
  # Create a numeric variable for the study group (ROP = 0, CHR = 1)
  mutate(studygroup_num = case_when(Studygroup %in% 'ROP' ~ 0,
                                    Studygroup %in% 'CHR' ~ 1))%>%
  ungroup()%>%
  dplyr::select(-Studygroup)

# Impute missing values: 
# Numeric columns are imputed with the group-wise mean (except for 'PSN' and 'uhr_1stdegree_re')
rop_chr_vars_imputed <- rop_chr_vars %>%
  mutate(across(-c(PSN), ~ as.numeric(.)))%>%
  group_by(studygroup_num)%>%
  mutate(across(-(c(uhr_1stdegree_re, #sex_num,
                    PSN)), 
                ~ case_when(is.na(.) ~ mean(., na.rm = TRUE),
                            TRUE ~ .)),
         uhr_1stdegree_re = 
           case_when(is.na(uhr_1stdegree_re) ~ 
                       median(uhr_1stdegree_re, na.rm = TRUE),
                     TRUE ~ uhr_1stdegree_re))%>%
  ungroup()

# Prepare final dataset for CCA analysis
cca_vars_rop_chr <- rop_chr_vars_imputed %>%
  rename(uhr_1stdegree_re = uhr_1stdegree_re)%>%
  column_to_rownames('PSN')

# Prepare DTI data (Regions of Interest - ROIs) for further analysis
significant_rois_names_3groups <- significant_rois_3groups %>%
  # Append '_dev' to ROI names for selecting the wrte values (deviation scores)
  mutate(roi_names = paste0(roi, '_dev', sep = ''))

# Filter and select relevant variables for DTI (based on significant ROIs)
rop_chr_dti <- tbss_final_fat %>%
  filter(Studygroup %in% c('ROP', 'CHR'))%>%
  dplyr::select(PSN, Studygroup, significant_rois_names_3groups$roi_names)%>%
  filter(PSN %in% rop_chr_vars$PSN)
cca_dti_rop_chr <- rop_chr_dti %>% 
  dplyr::select(-c(Studygroup, contains('Average')))%>%
  column_to_rownames('PSN')

# To write the dataframes out for cross-validation, we need to make 'PSN' a column again
# This is necessary because PSN is currently a row name, which we need to convert to a regular column 
cv_cca_dti_rop_chr <- cca_dti_rop_chr %>%
  rownames_to_column('PSN')
cv_cca_vars_rop_chr <- cca_vars_rop_chr %>%
  rownames_to_column('PSN')

# if you have adapted the script and want to look into the data you can write 
# out the datafor cross validation again. Otherwise let this part commented. 
# write.csv(cv_cca_vars_rop_chr, 'simulated_data/cca_vars_rop_chr_for_cv.csv', row.names = FALSE)
# write.csv(cv_cca_dti_rop_chr, 'simulated_data/cca_vars_rop_chr_for_cv.csv', row.names = FALSE)

4.1.2 Result+Code: Nr (%) missing all

# here, we plot for each variable within the CCA analysis how many missing data
# we expect to have.
rop_chr_vars %>%
  pivot_longer(cols = -c(PSN, studygroup_num), 
               names_to = 'variables', values_to = 'values')%>%
  mutate(missing = case_when(is.na(values) ~ 'missing',
                             !is.na(values) ~ 'no_missing'))%>%
  group_by(variables)%>%
  mutate(variables_count = n())%>%
  ungroup()%>%
  group_by(variables, missing)%>%
  mutate(missing_count = n())%>%
  ungroup()%>%
  distinct(variables, missing, missing_count, variables_count)%>%
  pivot_wider(names_from = missing, values_from = missing_count)%>%
  mutate(perc_missing = floor((missing/variables_count)*100))%>%
  ggplot(aes(x = variables, y = perc_missing))+
  geom_bar(stat = 'identity')+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90))+
  geom_text(aes(label = paste0(missing, sep = '')), 
            vjust = -2.5, colour = '#1B1919FF', size = 3)+
  geom_text(aes(label = paste0('(',perc_missing, '%)', sep = '')),
            vjust = -1, colour = '#1B1919B2', size = 3)+
  ylab('Percent % of missing')+
  ylim(c(0, 35))

4.1.3 Result+Code: Nr (%) missing per group

# here, we plot for each variable within the CCA analysis how many missing data
# we expect to have.
rop_chr_vars %>%
  pivot_longer(cols = -c(PSN, studygroup_num), 
               names_to = 'variables', values_to = 'values')%>%
  mutate(missing = case_when(is.na(values) ~ 'missing',
                             !is.na(values) ~ 'no_missing'))%>%
  group_by(studygroup_num, variables)%>%
  mutate(variables_count = n())%>%
  ungroup()%>%
  group_by(studygroup_num, variables, missing)%>%
  mutate(missing_count = n())%>%
  ungroup()%>%
  distinct(studygroup_num, variables, missing, missing_count, variables_count)%>%
  pivot_wider(names_from = missing, values_from = missing_count)%>%
  mutate(perc_missing = floor((missing/variables_count)*100))%>%
  ggplot(aes(x = variables, y = perc_missing))+
  geom_bar(stat = 'identity')+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90))+
  geom_text(aes(label = paste0(missing, sep = '')), 
            vjust = -2.5, colour = '#1B1919FF', size = 3)+
  geom_text(aes(label = paste0('(',perc_missing, '%)', sep = '')),
            vjust = -1, colour = '#1B1919B2', size = 3)+
  ylab('Percent % of missing')+
  ylim(c(0, 55))+
  facet_wrap(.~ studygroup_num, ncol = 1)

4.1.4 Result+Code: Permutation analysis

In a recent review, Wang et al. (2020) described a permutation procedure for Canonical Correlation Analysis (CCA) as “a random shuffling of the rows of the two variable sets” (Winkler et al., 2020). Winkler et al. (2020) argue that simple permutation tests for CCA lead to excessive error rates for all canonical correlations except the first, as such tests fail to account for the variance explained by previous canonical variables. They propose a solution that involves estimating canonical correlations stepwise, removing explained variance at each iteration while addressing different numbers of variables on each side. While Winkler et al. (2020) provided MATLAB code, we have translated it into R (see functions on GitHub).

##       canonical_variates_rc pfwer_rc
##  [1,]           0.450737603    0.001
##  [2,]           0.398167886    0.011
##  [3,]           0.350193153    0.163
##  [4,]           0.338617262    0.487
##  [5,]           0.289269654    0.869
##  [6,]           0.248706862    0.970
##  [7,]           0.233015761    0.988
##  [8,]           0.224151348    0.996
##  [9,]           0.208299480    1.000
## [10,]           0.153651873    1.000
## [11,]           0.125298104    1.000
## [12,]           0.111867690    1.000
## [13,]           0.077222910    1.000
## [14,]           0.047264853    1.000
## [15,]           0.038116001    1.000
## [16,]           0.009202222    1.000

4.1.5 Result+Code: Bootstrap for feature importance

After identifying significant canonical correlations based on family-wise error corrected p-values, we performed bootstrapping using 1000 subsamples to assess which variables contributed significantly to the overall correlation across variable pairs. This procedure was based on the method described by Winkler et al. (2020), which involves stepwise estimation of canonical correlations while removing previously explained variance at each iteration. We adapted the R package Data4PCCAr (https://github.com/HerveAbdi/data4PCCAR/blob/master/R/Boot4CCA.R) to implement this approach, ensuring it aligns with the specific steps outlined by Winkler et al. (2020).

boot_rop_chr <- bootstrap_winkler(as.matrix(cca_vars_rop_chr),# %>% 
                                          as.matrix(cca_dti_rop_chr),
                                          nf2keep = ncol(cca_vars_rop_chr),#%>% 
                                          nIter = 1000, 
                                          critical.value = 2)
boot_rop_chr$bootRatiosSignificant.i %>%
  as.data.frame()%>%
  filter(`Dimension 1` == TRUE|
           `Dimension 2` == TRUE)%>%
  dplyr::select(c(`Dimension 1`, `Dimension 2`))
##                            Dimension 1 Dimension 2
## GAF_DI_PastMonth_Screening        TRUE       FALSE
## GAF_DI_LifeTime_Screening         TRUE       FALSE
## GAF_S_PastMonth_Screening         TRUE       FALSE
## GAF_S_LifeTime_Screening          TRUE       FALSE
## sips_p_t0_re                     FALSE        TRUE
## sips_n_t0_re                     FALSE        TRUE
## sips_d_t0_re                      TRUE        TRUE
## spia_sum_t0_re                    TRUE       FALSE
## uhr_1stdegree_re                 FALSE        TRUE
## wais_mr                           TRUE       FALSE
## wais_v                            TRUE        TRUE
## studygroup_num                    TRUE        TRUE
boot_rop_chr$bootRatiosSignificant.j %>%
  as.data.frame()%>%
  filter(`Dimension 1` == TRUE|
           `Dimension 2` == TRUE)%>%
  dplyr::select(c(`Dimension 1`, `Dimension 2`))
##           Dimension 1 Dimension 2
## FX_ST_dev        TRUE       FALSE
## EC_dev           TRUE       FALSE
## CGC_dev          TRUE       FALSE
## CGH_dev          TRUE       FALSE
## FX_dev          FALSE        TRUE
## GCC_dev          TRUE        TRUE
## PTR_dev          TRUE       FALSE
## RLIC_dev         TRUE       FALSE
## SLF_dev          TRUE       FALSE
## ACR_dev          TRUE       FALSE
## SCP_dev          TRUE       FALSE
## PCR_dev          TRUE       FALSE
## SS_dev           TRUE       FALSE
## SCC_dev         FALSE        TRUE

4.1.6 Code: Extract the loadings and define color-code

# 1. Extracting bootstrapped loadings for each iteration and creating data frames for plotting.
# These bootstrapped loadings will be used for boxplots, while the individual loadings 
# derived from the canonical correlation analysis (CCA) function will be used to base the color coding.

# Process bootstrapped loadings for Dimension 1 and Dimension 2 (first for variables and then for DTI):
# - Extract weights for Dimension 1 (bootstrapped) and reshape to long format.
# - Extract weights for Dimension 2 (bootstrapped) and reshape to long format.

# Step 1a: Extract bootstrapped weights for Dimension 1 (Variables)
loadings_vars_dim1_boot <- boot_rop_chr$bootWeights.i %>%
  as.data.frame()%>%
  dplyr::select(starts_with(c('Dimension 1.')))%>%
  rownames_to_column('features')%>%
  pivot_longer(-features, names_to = 'iteration', values_to = 'loadings_boot')%>%
  dplyr::select(-iteration)%>%
  mutate(dimension = 'V1')

# Step 1b: Extract bootstrapped weights for Dimension 2 (Variables)
loadings_vars_dim2_boot <- boot_rop_chr$bootWeights.i %>%
  as.data.frame()%>%
  dplyr::select(starts_with(c('Dimension 2.')))%>%
  rownames_to_column('features')%>%
  pivot_longer(-features, names_to = 'iteration', values_to = 'loadings_boot')%>%
  dplyr::select(-iteration)%>%
  mutate(dimension = 'V2')

# Step 1c: Combine the data for both dimensions (Dimension 1 and Dimension 2) into a single data frame.
loadings_vars_boot <- loadings_vars_dim1_boot %>%
  rbind(., loadings_vars_dim2_boot)

# Step 2a: Extract bootstrapped weights for Dimension 1 (DTI)
loadings_dti_dim1_boot <- boot_rop_chr$bootWeights.j %>%
  as.data.frame()%>%
  dplyr::select(starts_with(c('Dimension 1.')))%>%
  rownames_to_column('features')%>%
  pivot_longer(-features, names_to = 'iteration', values_to = 'loadings_boot')%>%
  dplyr::select(-iteration)%>%
  mutate(dimension = 'V1')

# Step 2b: Extract bootstrapped weights for Dimension 2 (DTI)
loadings_dti_dim2_boot <- boot_rop_chr$bootWeights.j %>%
  as.data.frame()%>%
  dplyr::select(starts_with(c('Dimension 2.')))%>%
  rownames_to_column('features')%>%
  pivot_longer(-features, names_to = 'iteration', values_to = 'loadings_boot')%>%
  dplyr::select(-iteration)%>%
  mutate(dimension = 'V2')

# Step 2c: Combine the data for both dimensions (Dimension 1 and Dimension 2) into a single data frame.
loadings_dti_boot <- loadings_dti_dim1_boot %>%
  rbind(., loadings_dti_dim2_boot)

# Step 3: Calculate the loadings using the CCA package.
x <- CCA::cc(cca_vars_rop_chr,cca_dti_rop_chr)
# Step 3a: Check if the CCA package results match the results from the Winkler 
# function to ensure consistency.
# Compare the canonical correlations to ensure that both methods give the same results.
if (sum(round(x$cor, digits = 8) - round(canonical_variates_rc, digits = 8)) != 0){
  print('CCA package did not provide the same canonical variates like Winkler-
        function! STOP')
}else{print('CCA package and Winkler-function provide same results')
}
## [1] "CCA package and Winkler-function provide same results"
# Step 3b: Extract the loadings from the CCA results for the variables and the DTI.
# These loadings will be used for creating the figures and color-coding in the final plots.
loadings_CCA_vars_rop_chr<-cor(cca_vars_rop_chr,x$scores$xscores,use="complete.obs") 
loadings_CCA_dti_rop_chr <-cor(cca_dti_rop_chr,x$scores$yscores,use="complete.obs")

# Step 4: Create the two data frames (vars_sig and dti_sig) for color-bar plotting.
# These will be used to create the color bars based on the bootstrapped results.
vars_sig <- boot_rop_chr$bootRatiosSignificant.i %>%
  as.data.frame()%>%
  dplyr::select(c(`Dimension 1`, `Dimension 2`))%>%
  rownames_to_column('features')%>%
  mutate(V1 = case_when(`Dimension 1` == TRUE ~ '1',
                                     `Dimension 1` == FALSE ~ '0'),
         V2 = case_when(`Dimension 2` == TRUE ~ '1',
                                     `Dimension 2` == FALSE ~ '0'))%>%
  dplyr::select(features, starts_with('V'))%>%
  pivot_longer(cols = starts_with('V'), names_to = 'dimension', 
                                  values_to = 'significance')

loadings_overall_vars <- loadings_CCA_vars_rop_chr%>%
  as.data.frame()%>%
  rownames_to_column('features')%>%
  dplyr::select(features, V1, V2)%>%
  pivot_longer(cols = -features, names_to = 'dimension', values_to = 'overall_loadings')%>%
  mutate(loadings_color = round(overall_loadings, digits = 3))

loadings_vars_plot <- loadings_vars_boot %>%
  as.data.frame()%>%
  left_join(., vars_sig, by = c('dimension', 'features'))%>%
  left_join(., loadings_overall_vars, by = c('dimension', 'features'))%>%
  mutate(domain = 'clinical')%>%
  mutate(loadings_color = as.character(loadings_color))%>%
  mutate(colorbar = case_when(significance == '0' ~ '0',
                              TRUE ~ loadings_color))

dti_sig <- boot_rop_chr$bootRatiosSignificant.j %>%
  as.data.frame()%>%
  dplyr::select(c(`Dimension 1`, `Dimension 2`))%>%
  rownames_to_column('features')%>%
  mutate(V1 = case_when(`Dimension 1` == TRUE ~ '1',
                                     `Dimension 1` == FALSE ~ '0'),
         V2 = case_when(`Dimension 2` == TRUE ~ '1',
                                     `Dimension 2` == FALSE ~ '0'))%>%
  dplyr::select(features, starts_with('V'))%>%
  pivot_longer(cols = starts_with('V'), names_to = 'dimension',
               values_to = 'significance')

loadings_overall_dti <- loadings_CCA_dti_rop_chr%>%
  as.data.frame()%>%
  rownames_to_column('features')%>%
  dplyr::select(features, V1, V2)%>%
  pivot_longer(cols = -features, 
               names_to = 'dimension', values_to = 'overall_loadings')%>%
  mutate(loadings_color = round(overall_loadings, digits = 3))

loadings_dti_plot <- loadings_dti_boot %>%
  as.data.frame()%>%
  left_join(., dti_sig, by = c('dimension', 'features'))%>%
  left_join(., loadings_overall_dti, by = c('dimension', 'features'))%>%
  mutate(loadings_color = as.character(round(overall_loadings, digits = 3)))%>%
  mutate(colorbar = case_when(significance == '0' ~ '0',
                              TRUE ~ loadings_color))%>%
  mutate(domain = 'dwi')

loadings_all <- loadings_vars_plot %>%
  rbind(., loadings_dti_plot %>% dplyr::select(names(loadings_vars_plot)))

overall_colors <- loadings_dti_plot %>%
  rbind(., loadings_vars_plot)%>%
  filter(significance %in% 1)%>%
  distinct(colorbar, .keep_all = TRUE)%>%
  arrange(colorbar)

# Define the colors for loadings (not adapted for simulated data):
color_values <- 
  c('-0.734' = "#6B3D3D", '-0.651'="#734343",  '0.618' = "#7C4A4A", 
    '-0.599' = "#855151", '-0.59' ="#8E5858",  '-0.488'= "#965E5E", 
    '-0.48'  = "#9F6565", '0.48'  ="#9F6565",  '-0.476'= "#A86C6C",  
    '-0.472' = "#B17373", '-0.442'="#BA7A7A",  '0.415' = "#C28080",  
    '-0.399' = "#CB8787", '0.381'= "#D48E8E",  '0.366' = "#DD9595",
    '0.365'  = "#E69C9C", '0.363'= "#E7A1A1", '-0.348' = "#E8A7A7",
    '-0.338' = "#EAACAC", '-0.321'="#EBB2B2", '-0.311' = "#ECB7B7",
    '-0.31'  = "#EEBDBD", '-0.303'="#EFC2C2", '0.295'  = "#F0C8C8",
    '-0.29'  = "#F2CDCD", '0.283' ="#F3D3D3", '-0.282' = "#F4D8D8",
    '-0.28'  = "#F6DEDE", '-0.273'='#F7E3E3', '0.261'  = "#F7E3E3", 
    '0' = 'lightgrey')

4.1.7 Figure: Plot loadings (CCA)

4.1.8 Code: Plot loadings (CCA)

### Plotting the loading of canonical variables per dimension:
dimension_label <- c('V1' = 'First mode',
                     'V2' = 'Second mode')
domain_label    <- c('clinical' = 'behavioral variables',
                     'dwi' = 'white-matter microstructure')
plot_loadings_cca <- loadings_all %>%
  mutate(colorbar = case_when(
    significance == '0' ~ '0',
    TRUE ~ as.character(round(as.numeric(loadings_color), 3))
  ))%>%
  ggplot(aes(x = loadings_boot, y = features, color = colorbar))+
  geom_boxplot(outlier.shape = NA, width = 0.5, lwd = 1)+
  theme_bw(base_size = 15)+
  scale_y_discrete(labels = c('wais_v' = 'WAIS: vocabulary',
                              'wais_mr'= 'WAIS: matrices',
                              'uhr_1stdegree_re'= 'familial risk: >= 1st degree',
                              'spia_sum_t0_re' = 'SPI-A symptoms',
                              'sips_p_t0_re' = 'SIPS: positive',
                              'sips_n_t0_re' = 'SIPS: negative',
                              'sips_g_t0_re' = 'SIPS: general',
                              'sips_d_t0_re' = 'SIPS: disorganized',
                              'GAF_S_PastMonth_Screening' = 'GAF-Symptoms: Past Month',
                              'GAF_DI_PastMonth_Screening' = 'GAF-DI: Past Month',
                              'GAF_S_LifeTime_Screening' = 'GAF-Symptoms: Lifetime',
                              'GAF_DI_LifeTime_Screening' = 'GAF-DI: Lifetime',
                              'exposome_score_re' = 'Aggregated environmental risk',
                              'CPZE_cum_sum_re' = 'CPZE: cumulative sum',
                              'bdi_t0_re' = 'BDI-II',
                              'SS_dev' = 'SS',
                              'SLF_dev' = 'SLF',
                              'SCP_dev' = 'SCP',
                              'EC_dev' = 'EC',
                              'FX_ST_dev' = 'FXST',
                              'CGC_dev' = 'CGC',
                              'FX_dev' = 'FX',
                              'GCC_dev' = 'GCC',
                              'PTR_dev' = 'PTR',
                              'ALIC_dev' = 'ALIC',
                              'CGH_dev' = 'CGH',
                              'CP_dev' = 'CP',
                              'RLIC_dev' = 'RLIC',
                              'ACR_dev' = 'ACR',
                              'PLIC_dev' = 'PLIC',
                              'PCR_dev' = 'PCR',
                              'SCC_dev' = 'SCC',
                              'ML_dev' = 'ML',
                              'studygroup_num' = 'Studygroup'
  ))+
  xlim(c(-0.75, 0.65))+
  facet_grid(domain~dimension, scales = 'free_y',
             labeller = labeller(domain = domain_label,
                                 dimension = dimension_label))+
  theme(axis.text = element_text(size = 17),
        legend.position = 'none',
        panel.spacing.x = unit(1, "lines"),  # Horizontal spacing between panels
        panel.spacing.y = unit(1, "lines"),
        axis.title.y = element_blank(),
        strip.text = element_text(face = "bold"),
        #strip.text = element_blank(),
        strip.placement = "outside",
        strip.background = element_rect(color = "white", fill = "white"),
        strip.text.x = element_text(vjust = 2, size = 20),
        strip.text.y = element_text(size = 20))+
  scale_color_manual(values = color_values)+
  xlab('canonical loadings')+
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey")  # Vertical line at x = 0

plot_loadings_cca

4.1.9 Extract canonical loadings:

### Extract canonical loadings:  
# The following code extracts the canonical loadings for the clinical and DTI (diffusion tensor imaging) variables 
# from bootstrapped projections and combines them into one data frame.

# Extract clinical projections and rename columns with a 'clin_' prefix for easy identification later.
clin_projections <- boot_rop_chr$clin_proj %>%
  as.data.frame(.)%>%
  rename_with(~ paste0("clin_", .), everything())

# Extract DTI projections and rename columns with a 'dwi_' prefix for easy identification later.
dwi_projections <- boot_rop_chr$dwi_proj %>%
  as.data.frame(.)%>%
  rename_with(~ paste0("dwi_", .), everything())

# Combine clinical and DTI projections into a single data frame.
projections <- clin_projections %>%
  cbind(., dwi_projections)

# Combine projections with canonical correlation analysis (CCA) variables.
# First, convert row names to a column for identification ('PSN').
cca_projections <- cca_vars_rop_chr %>%
  rownames_to_column('PSN')%>%
  left_join(., cca_dti_rop_chr %>% rownames_to_column('PSN'), by ='PSN')%>%
  cbind(., projections)

4.1.10 Figure: canonical loadings

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

4.1.11 Code: plot canonical loadings

Here, as an example we plot the loadings of the first component clinical variables vs. DWI. The function can be used for each clinical variables vs. DWI-loading as we present in the Supplementary material.

# Function to plot the correlation individually and together
# Plots the first component of the canonical loadings
plot_loadings_individually <- function(projection_df, x_var, y_var,
                                       x_label, y_label,
                                       saving_label){
  # Define custom facet labels for the study groups
  facet_labels_studygroup <- c('0' = 'recent-onset psychosis',
                               '1' = 'clinical high-risk')
  
  # Plot the individual group-wise scatter plots
  # Facet the plot by studygroup_num to show separate plots for different groups
  plot_group <- projection_df %>%
    ggplot(aes(x = !!sym(x_var), y = !!sym(y_var)))+
    geom_smooth(method = 'lm', color = 'black')+
    geom_point(alpha = 0.3)+
    facet_wrap(.~studygroup_num, 
               labeller = labeller(studygroup_num = facet_labels_studygroup),
               ncol = 1)+
    theme_bw(base_size = 15)+
    xlab(x_label)+
    ylab(y_label)
  
  # Plot the combined plot for both groups in a single graph
  # Display scatter plot with group coloring and regression line
  plot_all <- projection_df %>%
    ggplot(aes(x = !!sym(x_var), y = !!sym(y_var)))+
    geom_smooth(method = 'lm', color = 'black')+
    geom_point(alpha = 0.6, aes(color = as.factor(studygroup_num)))+
    theme_bw(base_size = 15)+
    xlab(x_label)+
    ylab(y_label)+
    scale_color_manual(values = c('0' = '#e69c9c',
                                  '1' = '#8ba6b1'),
                       labels = c('0' = 'recent-onset psychosis',
                                  '1' = 'clinical high-risk'))+
    theme(
      legend.title = element_blank(),
    legend.position = "bottom", # Move legend below
    legend.direction = "horizontal", # Horizontal legend layout
    legend.text = element_text(size = 11), # Adjust text size
    legend.key.height = unit(0.1, "cm"),   # Reduce legend key height
    legend.spacing.x = unit(0.01, "cm")     # Reduce spacing between legend items

  ) +
  guides(
    color = guide_legend(nrow = 2,
                         byrow = TRUE) # Arrange legend in two rows
  )

  # Combine the two plots (group-wise and combined) side-by-side
  plot_df <- plot_grid(plot_all, plot_group, ncol = 2)  # ncol = 2 means side-by-side
  plot_df
}

# Call the function to plot the first component's loadings
plot_loadings_individually(cca_projections, x_var = 'clin_V1', y_var = 'dwi_V1', 
                           x_label = 'loadings mode 1: clinical',
                           y_label = 'loadings mode 1:\nwhite-matter microstructure',
                           saving_label = 'plot_loading1')
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

5 Cohen’s d of group effects

here, we show the group effects calculated using hte emmeans package adjusted for age and sex.

5.1 Adjusted Cohen’s d

5.1.1 Figure: group comparison adjusted Cohen’s d

5.1.2 Code: group comparison adjusted Cohen’s d

Since the jackknife analysis with different sites takes a long time to run, we comment out the function here to speed up the code.

# # Here, we create the plot for FAt values per ROI represented as Cohen's d
# 
# # First, process p-values for the post hoc tests by creating 'group1' and 'group2' based on the contrast 
# # and selecting relevant columns (roi, group1, group2, and p.value)
# p_values <- result_posthoc_pp %>%
#   mutate(group1 = case_when(contrast %in% 'CHR - ROP' ~ 'CHR',
#                             contrast %in% 'CHR - HC' ~ 'CHR',
#                             contrast %in% 'HC - ROP' ~ 'HC'),
#          group2 = case_when(contrast %in% 'CHR - ROP' ~ 'ROP',
#                             contrast %in% 'CHR - HC' ~ 'HC',
#                             contrast %in% 'HC - ROP' ~ 'ROP'))%>%
#   dplyr::select(roi, group1, group2, p.value)
# 
# # For effect size calculation, perform a similar process, but without the significance filtering step.
# # Here, we calculate Cohen's d for all ROIs regardless of statistical significance
# effect_size <- tbss_final_fat %>%
#   dplyr::select(Studygroup, PSN, rois, age, sex) %>%
#   dplyr::select(-contains('Average'))%>%
#   filter(Studygroup %in% c('ROP', 'HC', 'CHR')) %>%
#   pivot_longer(cols = rois, 
#                names_to = 'roi', values_to = 'value') %>%
#   group_by(roi) %>%
#   do({
#     anova_result <- aov(value ~ Studygroup + age + I(age^2) + sex, data = .)
#     # look further into the different types of ANCOVA
#     ANOVA_result <- Anova(anova_result, type = 3)
#     if (ANOVA_result$`Pr(>F)`[2] < 0.05){
#       significant_test <- 'yes'
#     }else{
#       significant_test <- 'no'
#     }
#     if (nrow(.) & significant_test == 'yes') {
#       posthoc_result <- emmeans(anova_result, pairwise ~ Studygroup, adjust = 'fdr')
#       effect_size <- eff_size(posthoc_result, sigma = sigma(anova_result), 
#                               edf = df.residual(anova_result))
#       data.frame(effect_size)
#     } else {
#       data.frame()
#     }
#   }) %>%
#   ungroup()%>%
#   mutate(group1 = case_when(contrast %in% '(CHR - ROP)' ~ 'CHR',
#                             contrast %in% '(CHR - HC)' ~ 'CHR',
#                             contrast %in% '(HC - ROP)' ~ 'HC'),
#          group2 = case_when(contrast %in% '(CHR - ROP)' ~ 'ROP',
#                             contrast %in% '(CHR - HC)' ~ 'HC',
#                             contrast %in% '(HC - ROP)' ~ 'ROP'))%>%
#   dplyr::select(roi, group1, group2, effect.size)%>%
#   left_join(., p_values, by = c('group1', 'group2', 'roi'))
# 
# # Define which effect sizes refer to significant rois.
# effect_size_corrected_all <- tbss_final_fat %>%
#   dplyr::select(Studygroup, PSN, rois, age, sex) %>%
#   dplyr::select(-contains('Average'))%>%
#   filter(Studygroup %in% c('ROP', 'HC', 'CHR')) %>%
#   pivot_longer(cols = rois, 
#                names_to = 'roi', values_to = 'value') %>%
#   group_by(roi) %>%
#   do({
#     anova_result <- aov(value ~ Studygroup + age + I(age^2) + sex, data = .)
#     ANOVA_result <- Anova(anova_result, type = 3)
#     posthoc_result <- emmeans(anova_result, pairwise ~ Studygroup, adjust = 'fdr')
#     effect_size <- eff_size(posthoc_result, sigma = sigma(anova_result), 
#                               edf = df.residual(anova_result))
#       data.frame(effect_size)
#       })%>%
#   ungroup()%>%
#   mutate(group1 = case_when(contrast %in% '(CHR - ROP)' ~ 'CHR',
#                             contrast %in% '(CHR - HC)' ~ 'CHR',
#                             contrast %in% '(HC - ROP)' ~ 'HC'),
#          group2 = case_when(contrast %in% '(CHR - ROP)' ~ 'ROP',
#                             contrast %in% '(CHR - HC)' ~ 'HC',
#                             contrast %in% '(HC - ROP)' ~ 'ROP'))%>%
#   dplyr::select(roi, group1, group2, effect.size)%>%
#   left_join(., p_values, by = c('group1', 'group2', 'roi'))
# 
# # Function to perform jackknife analysis and calculate 95% CI for Cohen's d
# jackknife_analysis <- function(df, contrast_predict, p_value_df){
#   
#   # Get unique sites
#   unique_sites <- unique(df$ArtificialSite)
#   
#   # Initialize a list to store results
#   results <- list()
#   
#   # Loop over each site to perform jackknife analysis
#   for (site in unique_sites) {
#     
#     # Filter out the current site
#     df_filtered <- df %>% filter(ArtificialSite != site)
#     
#     # Calculate Cohen's d
#     #cohens_d_result <- calculate_cohensd(df_filtered, groups, p_value_df)
#     effect_size_result <- df_filtered %>%
#       dplyr::select(Studygroup, PSN, rois, age, sex) %>%
#       dplyr::select(-contains('Average'))%>%
#       filter(Studygroup %in% c('ROP', 'HC', 'CHR')) %>%
#       pivot_longer(cols = rois, 
#                names_to = 'roi', values_to = 'value') %>%
#       group_by(roi) %>%
#       do({
#         anova_result <- aov(value ~ Studygroup + age + I(age^2) + sex, data = .)
#         # look further into the different types of ANCOVA
#         ANOVA_result <- Anova(anova_result, type = 3)
#         if (ANOVA_result$`Pr(>F)`[2] < 0.05){
#           significant_test <- 'yes'
#           }else{
#             significant_test <- 'no'
#           }
#         posthoc_result <- emmeans(anova_result, pairwise ~ Studygroup, adjust = 'fdr')
#         effect_size <- eff_size(posthoc_result, sigma = sigma(anova_result), edf = df.residual(anova_result))
#         data.frame(effect_size)
#         }) %>%
#       ungroup()%>%
#       filter(contrast %in% contrast_predict)%>%
#       mutate(group1 = case_when(contrast %in% '(CHR - ROP)' ~ 'CHR',
#                                 contrast %in% '(CHR - HC)' ~ 'CHR',
#                                 contrast %in% '(HC - ROP)' ~ 'HC'),
#              group2 = case_when(contrast %in% '(CHR - ROP)' ~ 'ROP',
#                                 contrast %in% '(CHR - HC)' ~ 'HC',
#                                 contrast %in% '(HC - ROP)' ~ 'ROP'))%>%
#       dplyr::select(roi, group1, group2, effect.size)%>%
#       left_join(., p_value_df, by = c('group1', 'group2', 'roi'))
#     # Store the result in the list, using the site as the key
#     results[[site]] <- effect_size_result
#   }
#   
#   # Initialize a list to store CIs
#   ci_list <- list()
#   
#   # Loop over each ROI (assuming ROI names are consistent across sites)
#   rois_loop <- unique(results[[1]]$roi)
#   for (roi_name in rois_loop) {
#     
#     # Extract Cohen's d values for the current ROI across all sites
#     cohens_d_values <- sapply(results, function(res) {
#       res %>% filter(roi == roi_name) %>% pull(effect.size)
#     })
#     
#     # Calculate mean Cohen's d
#     mean_d <- mean(cohens_d_values)
#     
#     # Calculate standard error
#     se_d <- sd(cohens_d_values) / sqrt(length(cohens_d_values))
#     
#     # Calculate 95% confidence interval
#     ci_lower <- mean_d - 1.96 * se_d
#     ci_upper <- mean_d + 1.96 * se_d
#     
#     # Store the results in a data frame
#     ci_list[[roi_name]] <- data.frame(
#       roi = roi_name,
#       mean_cohens_d = mean_d,
#       ci_lower = ci_lower,
#       ci_upper = ci_upper
#     )
#   }
#   
#   # Combine all ROIs into one data frame
#   ci_df <- do.call(rbind, ci_list)
#   
#   return(ci_df)
# }
# 
# # Apply the jackknife analysis with CI calculation for each comparison
# ci_rop_hc <- jackknife_analysis(tbss_final_fat, '(HC - ROP)', p_values)%>%
#   mutate(contrast = 'HC_ROP')
# ci_rop_chr <- jackknife_analysis(tbss_final_fat, c('(CHR - ROP)'), p_values)%>%
#   mutate(contrast = 'CHR_ROP')
# ci_hc_chr <- jackknife_analysis(tbss_final_fat, c('(CHR - HC)'), p_values)%>%
#   mutate(contrast = 'CHR_HC')
# 
# ci_groups <- ci_rop_hc %>%
#   rbind(., ci_rop_chr)%>%
#   rbind(., ci_hc_chr)
# 
# groups_effsize <- effect_size_corrected_all %>%
#   filter(!is.na(p.value))%>%
#   dplyr::select(group1, group2, effect.size, roi, p.value)%>%
#   unite('contrast', c(group1, group2), sep = '_')%>%
#   mutate(significance = case_when(p.value < 0.05 ~ 'sig',
#                                   p.value == 0.05|p.value > 0.05 ~ 'non'))%>%
#   mutate(effect.size = case_when(contrast %in% 'CHR_HC' ~ effect.size*-1,
#                              TRUE ~ effect.size))%>%
#   unite('groups_sig', c(contrast, significance), sep = '_', remove = FALSE)%>%
#   left_join(., ci_groups, by = c('contrast', 'roi'))%>%
#   mutate(ci_lower = case_when(contrast %in% 'CHR_HC' ~ ci_lower*-1,
#                               TRUE ~ ci_lower),
#          ci_upper = case_when(contrast %in% 'CHR_HC' ~ ci_upper*-1,
#                               TRUE ~ ci_upper))
# 
# labels_rois <- c(
#   'ACR' = 'anterior corona\nradiata (ACR)',
#   'ALIC' = 'anterior limb of\ninternal capsule (ALIC)',
#   'BCC' = 'body of\ncorpus callosum (BCC)',
#   'CGC' = 'cingulum (CGC)',
#   'CGH' = 'cingulum\n(hippocampal portion) (CGH)',
#   'CP' = 'cerebellar peduncle (CP)',
#   'CST' = 'corticospinal\ntract (CST)',
#   'EC' = 'external capsule (EC)',
#   'FX' = 'fornix (FX)',
#   'FX_ST' = 'fornix stria\nterminalis (FXST)',
#   'GCC' = 'genu of\ncorpus callosum (GCC)',
#   'ICP' = 'internal capsule (ICP)',
#   'IFO' = 'inferior fronto\noccipital fasciculus (IFO)',
#   'ML' = 'medial lemniscus (ML)',
#   'PCR' = 'posterior\ncorona radiata (PCR)',
#   'PLIC' = 'posterior limb of\ninternal capsule (PLIC)',
#   'PTR' = 'posterior\nthalamic radiation (PTR)',
#   'RLIC' = 'retrolenticular\npart of internal capsule (RLIC)',
#   'SCC' = 'splenium of\ncorpus callosum (SCC)',
#   'SCP' = 'superior\ncerebellar peduncle (SCP)',
#   'SCR' = 'superior\ncorona radiata (SCR)',
#   'SFO' = 'superior fronto-\noccipital fasciculus (SFO)',
#   'SLF' = 'superior\nlongitudinal fasciculus (SLF)',
#   'SS' = 'sagittal stratum (SS)',
#   'UNC' = 'uncinate (UNC)'
# )
# plot_brain_effsize <- groups_effsize %>%
#   # March 6h 25: delete all CHR_ROP contrasts
#   filter(!contrast %in% 'CHR_ROP')%>%
#   mutate(roi = 
#            factor(roi, levels = 
#                     c('FX_ST', 'ALIC', 'PLIC', 'CP', 'EC', 'CGC', 'CGH', 'PTR', 
#                       'RLIC', 'FX', 'ACR', 'SLF', 'SCP', 'PCR', 'SS',  'GCC', 'SCC')))%>%
#   mutate(groups_sig = factor(contrast, levels = c('HC_ROP', 'CHR_HC')))%>%
#   mutate(contrast = factor(contrast, levels = c('HC_ROP', 'CHR_HC')))%>%
#   ggplot(aes(x = effect.size, y = contrast, fill = contrast))+
#   geom_bar(stat = 'identity', width = 0.8, 
#            position = position_dodge2(width = 0.1, preserve = "single"))+
#   theme_minimal(base_size = 12)+
#   scale_fill_manual(values = c('CHR_HC' = '#98c692',
#                                'HC_ROP' = '#8ba6b1'),
#                     breaks = c('CHR_HC', 'HC_ROP'),
#                     name = 'Group contrasts:\nSign refers to positive effect size',
#                     labels = c('CHR_HC' = 'HC > CHR',
#                                'HC_ROP' = 'HC > ROP'),
#                     guide = guide_legend(title.position = "top"))+
#   facet_grid(roi ~ ., switch = "y", 
#              labeller = labeller(roi = labels_rois)) +
#     # Set axis at both bottom and top
#   scale_x_continuous(sec.axis = dup_axis(name = "Adjusted effect size: measured as Cohen's d")) +
#   theme(
#     strip.placement = "outside",
#     strip.text.y.left = element_text(angle = 0, size = 7.5),
#     axis.title.y = element_blank(),
#     axis.text.y = element_blank(),
#     legend.text = element_text(size = 7),    # Decrease legend text size
#     legend.title = element_text(size = 8),   # Decrease legend title size
#     legend.box.just = "left",                # Ensures the legend box aligns to the left
#     legend.margin = margin(t = -350, r = 0, b = 0, l = 0),
#     #legend.position = 'none',
#     plot.background = element_rect(fill = "white", color = NA),  # Set background to white
#     panel.grid.major = element_blank(),
#     panel.grid.minor = element_blank(),
#     panel.spacing = unit(0.7, "lines"),
#     axis.line.y = element_line(color = "lightgray"),
#     plot.margin = unit(c(1, 1, 1, 1), "cm"), # Controls the plot margin
#     aspect.ratio = 1/8, # Adjust aspect ratio to make the plot narrower
#     axis.title.x.top = element_text(vjust = 1)   # Adjust top x-axis title
#   )+
#   xlab("Adjusted effect size: measured as Cohen's d")+
#   geom_vline(xintercept = seq(-0.2, 0.4, by = 0.1), color = "white")+
#   geom_errorbar(aes(xmin = ci_lower, xmax = ci_upper), 
#                 position = position_dodge(width = 0.9), 
#                 width = 0.2, 
#                 color = "darkgrey")+
#   geom_text(aes(label = ifelse(significance == 'sig', "*", ""), 
#                 x = effect.size + 0.05),  # Adjust 0.1 for space above the bar
#             position = position_dodge(width = 0.9), 
#             vjust = 0.75,
#             size = 16 / .pt)